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Abstract 



Background: Dermoscopy is one of the major imaging modalities used in the diagnosis of melanoma and other pigmented 
skin lesions. Due to the difficulty and subjectivity of human interpretation, dermoscopy image analysis has become an 
important research area. One of the most important steps in dermoscopy image analysis is the automated detection of 
lesion borders. Although numerous methods have been developed for the detection of lesion borders, very few studies were 
comprehensive in the evaluation of their results. Methods: In this paper, we evaluate five recent border detection methods 
on a set of 90 dermoscopy images using three sets of dermatologist-drawn borders as the ground-truth. In contrast to previous 
work, we utilize an objective measure, the Normalized Probabilistic Rand Index, which takes into account the variations in 
the ground-truth images. Conclusion: The results demonstrate that the differences between four of the evaluated border 
• ■ detection methods are in fact smaller than those predicted by the commonly used XOR measure. 
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QL- Introduction 

^invasive and in-situ malignant melanoma together comprise one of the most rapidly increasing cancers in the world. Invasive 
K^nelanoma alone has an estimated incidence of 62,480 and an estimated total of 8,420 deaths in the United States in 2008 [Tj. 
%jEarly diagnosis is particularly important since melanoma can be cured with a simple excision if detected early. 

Dermoscopy, also known as epiluminescence microscopy, is a non-invasive skin imaging technique that uses optical magnifi- 
cation and either liquid immersion and low angle-of-incidence lighting or cross-polarized lighting, making subsurface structures 
more easily visible when compared to conventional clinical images [2]. Dermoscopy allows the identification of dozens of mor- 
phological features such as pigment network, dots/globules, streaks, blue- white areas, and blotches [3]. This reduces screening 
errors, and provides greater differentiation between difficult lesions such as pigmented Spitz nevi and small, clinically equivocal 
lesions [3]. However, it has been demonstrated that dermoscopy may actually lower the diagnostic accuracy in the hands 
of inexperienced dermatologists [S]. Therefore, in order to minimize the diagnostic errors that result from the difficulty and 
subjectivity of visual interpretation, the development of computerized image analysis techniques is of paramount importance [BJ. 

Automated border detection is often the first step in the automated or semi-automated analysis of dermoscopy images [Ti- 
lt is crucial for the image analysis for two main reasons. First, the border structure provides important information for accurate 
diagnosis as many clinical features such as asymmetry, border irregularity, and abrupt border cutoff are calculated directly 
from the border. Second, the extraction of other important clinical features such as atypical pigment network [BJ, globules [5], 
and blue-white areas [5] critically depends on the accuracy of border detection. Automated border detection is a challenging 
task due to several reasons: (i) low contrast between the lesion and the surrounding skin, (ii) irregular and fuzzy lesion 
borders, (iii) artifacts and intrinsic cutaneous features such as black frames, skin lines, blood vessels, hairs, and air bubbles, 
(iv) variegated coloring inside the lesion, and (v) fragmentation due to various reasons such as scar-like depigmentation. 

Numerous methods have been developed for border detection in dermoscopy images [lOj . Recent approaches include fuzzy 
c- means clustering [TTJ [TJJ [T3] , gradient vector flow snakes [TT] , thresholding followed by region growing [131 ITB] , meanshift 
clustering [TTJ, color quantization followed by spatial segmentation [18], statistical region merging [19], two-stage k-means++ 
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clustering followed by region merging [20], and contrast enhancement followed by k- means clustering [21]. Some of these 
studies used subjective visual examination to evaluate their results. Others used objective measures including Hance et al.'s 
XOR measure [22], sensitivity & specificity, precision & recall, error probability, and pixel misclassification probability |23j . 
These measures require borders drawn by dermatologists, which serve as the ground truth. In this paper, we refer to the 
computer-detected borders as automatic borders and those determined by dermatologists as manual borders. 

In a recent study, Guillod et al. [23] demonstrated that a single dermatologist, even one who is experienced in dermoscopy, 
cannot be used as an absolute reference for evaluating border detection accuracy. In addition, they emphasized that manual 
borders are not precise, with inter-dermatologist borders and even intra-dermatologist borders showing significant disagreement, 
so that a probabilistic model of the border is preferred to an absolute gold-standard model. 

Only a few of the above-mentioned studies used borders determined by multiple dermatologists. Guillod et al. [53] used 
fifteen sets of borders determined by five dermatologists over a minimum period of one month. They constructed a probability 
image for each lesion by associating a misclassification probability with each pixel based on the number of times it was selected 
as part of the lesion. The automatic borders were then compared against these probability images. Iyatomi et al. |15[ 116] 
modified Guillod et a/.'s method by combining the manual borders that correspond to each image into one using the majority 
vote rule. The automatic borders were then compared against these combined ground-truth images. Celebi et al. [19] compared 
each automatic border against multiple manual borders independently. 

In this paper, we evaluate the performance of five recent automated border detection methods on a set of 90 dermoscopy 
images using three sets of manual borders as the ground-truth. In contrast to prior studies, we employ an objective criterion 
that takes into account the variations in the ground-truth images. 

The rest of the paper is organized as follows. Section [2] reviews the objective measures used previously in the border 
detection literature. Section [3] describes a recent measure that takes into account the variations in the ground-truth images. 
Section @] presents the experimental setup and discusses the results obtained, while Section [5] concludes the paper. 

2 Review of Objective Measures for Border Detection Evaluation 

All of the objective measures mentioned in Section[T] except for Guillod et a/.'s probabilistic measure, are based on the concepts 
of true/false positive/negative defined in Table [1] For example, if a lesion pixel is detected as part of the background skin, 
this pixel is considered to be a False Negative. On the other hand, if a background pixel is detected as part of the lesion, it 
is considered as a False Positive. Note that in the remainder of this paper, True Positive (TP), False Negative (FN), False 
Positive (FP), and True Negative (TN) will refer to the number of pixels that satisfy these criteria. 



Table 1: Definitions of true/false positive/negative. 'Actual' and 'detected' pixels refer to a pixel in the manual border and 
the corresponding pixel in the automatic border, respectively. 

Detected Pixel 



Actual Pixel 


Lesion 


Background 


Lesion 
Background 


True Positive (TP) 
False Positive (FP) 


False Negative (FN) 
True Negative (TN) 



2.1 XOR Measure 

The XOR measure, first used by Hance et al. [35] quantifies the percentage border detection error as 

Area(ABeMB) 

Error = r-^ — nrrm — - x 100% 

Area(MB) ivs 

_ FP+FN 1Q0% V ' 

~ TP+FN luu/0 

where AB and MB are the binary images obtained by filling the automatic and manual borders, respectively, © is the exclusive- 
OR (XOR) operation that gives the pixels for which AB and MB disagree, and Area(J) denotes the number of pixels in the 
binary image /. The drawback of this composite measure is that it tends to favor larger lesions due to the size term in the 
denominator. 



2.2 Sensitivity & Specificity 

Sensitivity (true positive rate) and specificity (true negative rate) are commonly used evaluation measures in medical studies. 
In our application domain, the former corresponds to the percentage of correctly detected lesion pixels, whereas the latter 
corresponds to the percentage of correctly detected background pixels. Mathematically, these measures are given by 

Sensitivity = X 100% 

Specificity = F ™ m x 100% 
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Note that an automatic border that encloses the corresponding manual border will have a perfect (100%) sensitivity. On 
the other hand, an automatic border border that is completely enclosed by the corresponding manual border will have a perfect 
specificity. Therefore, it is crucial not to interpret these measures in isolation from each other. 

2.3 Precision & Recall 

Precision (positive predictive value) and recall are commonly used evaluation measures in information retrieval studies. Preci- 
sion refers to the percentage of correctly detected lesion pixels over all the pixels detected as part of the lesion and is defined 

as 

TP 

Precision = Tp + pp X 100% (3) 

Recall is equivalent to sensitivity as defined in (|2|) . Note that as in the case of sensitivity and specificity, precision and recall 
measures should be interpreted together. 

2.4 Error Probability 

Error probability refers to the percentage of pixels incorrectly detected as part of the lesion or background over all the pixels. 
It is calculated as 

FP + FN 

Error probability = x 100% (4) 

t y TP + FN + FP + TN W 

The drawback of this composite measure is that it disregards the distributions of the classes. For example, consider a small 
lesion of size 20, 000 pixels in a large image of size 768 x 512 pixels. An automatic border of size 40, 000 pixels that encloses 
the manual border for this lesion will have an error probability of about 5% despite the fact that the automatic border is twice 
as large as the manual border. 

2.5 Pixel Misclassification Probability 

In [23] the probability of misclassification for a pixel is defined as 

RM) = 1 Jjr- (5) 

where N is the number of observations (manual + automatic borders), and n(i, j) is the number of times pixel (i, j) was selected 
as part of the lesion. For each automatic border, the detection error is given by the mean probability of misclassification over 
the pixels inside the border 

E P{i,j) 

Error = ihl)eAB x 100% (6) 

TP + FP w 

2.6 Error Measures Used in Previous Studies 

Table [2] compares recent border detection methods based on their evaluation methodology: the number of human experts who 
determined the manual borders, the number of images used in the evaluations (and the diagnostic distribution of these images 
if available), and the measure used to quantify the border detection error. It can be seen that: 

• Recent studies used objective measures to validate their results, whereas earlier studies relied on visual assessment. 

• Only 5 out of 19 studies involve more than one expert in the evaluation of their results. 

• XOR measure is the most commonly used objective error function despite the fact that it is not trivial to extend this 
measure to capture the variations in multiple manual borders. 



3 Proposed Measure for Border Detection Evaluation 

The objective measures reviewed in the previous section share a common deficiency. They do not take into account the 
variations in the manual borders. Given an automatic border, the XOR measure, sensitivity & specificity, precision & recall, 
and error probability can only be defined with respect to a single manual border. Therefore, it is not possible to use these 
measures with multiple manual borders. Although the methods described in [23] . [151 116] . and |19j allow the use of multiple 
manual borders; these methods do not accurately capture the variations in the manual borders. For example, using Guillod 
et a/.'s measure an automated border that is entirely enclosed by the manual borders would get a very low error. Iyatomi et 
aZ.'s method discounts the variation in the manual borders by simple majority voting, while Cclcbi et a/.'s approach does not 
produce a scalar error value, which makes comparisons more difficult. 
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Table 2: Evaluation of border detection methods (b: benign, m: melanoma) 



Ref. 


Year 


# Experts 


=ff Images (Distribution) 


Error Measure (Value) 


.13 


2009 


1 


100 (70 b / 30 m) 


Sens. (78%) & Spec. (99%) 




2008 


3 


90 (65 b / 25 m) 


XOR (10.63%) 


m 


2008 


1 


67 


XOR (14.63%) 


m 


2008 


1 


100 (70 b / 30 m) 


XOR (2.73%) 


m 


2007 


1 


50 


Error probability (16%) 


m 


2007 


1 


50 


Error probability (21%) 


.18 


2007 


2 


100 (70 b / 30 m) 


XOR (12.02%) 


.15 


2006 


5 


319 (244 b / 75 m) 


Prcc. (94.1%) & Rec. (95.2%) 


na 


2006 


nr 


117 


Sens. (95%) & Spec. (96%) 


El 


2005 


2 


100 (70 b / 30 m) 


XOR (15.59%) 


25 


2003 





nr 


nr 


PI 


2002 





600 


Visual 


[26] 


2001 





nr 


nr 


m\ 


2000 


5 


30 


Visual 


.28 


2000 


1 


30 


Visual 


m 


1999 


1 


400 


Visual 


[29] 


1999 


1 


300 


Visual 


[30] 


1998 


1 


57 


XOR (36.50%) 


[30] 


1998 


1 


57 


XOR (24.71%) 



In this paper we propose to use a recent, more elaborate probabilistic measure, namely the Normalized Probabilistic Rand 
Index (NPRI) [31] to evaluate border detection accuracy. We first describe the Probabilistic Rand Index (PRI) [32] . Consider a 
set of manual segmentations {Si, . . . , Sk} of an image X — {xi, . . . , xn} consisting of N pixels. Let Stest be the segmentation 
that is to be compared with the manually labeled set. We denote the label of point Xi by lf tcst in segmentation Stest and by 
lf k in the manually segmented image Sfe. 

The motivation behind the PRI is that a segmentation is judged as 'good' if it correctly identifies the pairwise relationships 
between the pixels as defined in the ground truth segmentations. In addition, a proper segmentation quality measure should 
penalize inconsistencies between the test and ground-truth label pair relationships proportionally to the level of consistency 
between the ground-truth label pair relationships. Based on this, the PRI is defined as 

E id djPij + (1 - Cij)(l - Pij) 
PRI (SWS,}) = — ^ rw -^ (7) 

2 



where /(.) is a boolean function defined as 

lit) 

Cij € {0, 1} denotes the event of a pair of pixels xi and Xj having the same label in the test image St, 



1 t = true 
t = false 



c ij =l{lf^=lp«*) (8) 

Note that the denominator in ([7} denotes the number of possible distinct pixel pairs. Given the K manually labeled images, 
we can compute the empirical probability of the label relationship of a pixel pair xi and Xj by 

fc=i 

The PRI is always within the interval [0, 1], and an index of or 1 can only be achieved when all of the ground-truth 
segmentations agree or disagree on every pixel pair relationship. A score of indicates that every pixel pair in the test image 
has the opposite relationship as every pair in the ground-truth segmentations, while a score of 1 indicates that every pixel pair 
in the test image has the same relationship as every pair in the ground-truth segmentations. 

The PRI has one disadvantage. Although the index values are in [0, 1], there is no expected value for a given segmentation. 
That is, it is impossible to know if any given score is good or bad. In addition, the score of a segmentation of one image cannot 
be compared with the score of a segmentation of another image. The Normalized Probabilistic Rand Index (NPRI) addresses 
this drawback by normalizing the PRI as follows 

Index — Exp. Index 

Normalized Index = (10) 

Max. Index — Exp. Index 
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Table 3: XOR measure statistics: mean (standard deviation) 



Dermatologist 


Diagnosis 


OSFCM 


DTEA 


MS 


JSEG 


SRM 




on i o"n 

ULllltl 11 


22.995 


10.513 


11.527 


10.832 


11.384 




(12.614) 


(4.728) 


(9.737) 


(6.359) 


(6.232) 


WS 


A/TpI a n r»m a 

X V _L IOjIJ-W 11 1 tX 


28.311 
(15.245) 


11.853 
(5.998) 


13.292 
(7.418) 


13.745 
(7.590) 


10.294 

(5.838) 




All 


24.354 


10.855 


11.978 


11.577 


11.106 




(13.449) 


(5.081) 


(9.193) 


(6.772) 


(6.120) 




_u 1 1 1 ti, 1 1 


25.535 


10.367 


10.802 


10.816 


10.186 




(11.734) 


(3.771) 


(6.332) 


(5.227) 


(5.683) 


JM 


A/Tola noma 


26.743 
(14.508) 


10.874 
(5.016) 


12.592 
(7.202) 


12.981 
(6.316) 


10.500 

(8.137) 




All 


25.843 


10.496 


11.259 


11.370 


10.266 




(12.426) 


(4.101) 


(6.571) 


(5.570) 


(6.351) 




Benign 


27.506 


12.091 


12.224 


12.257 


10.561 




(12.789) 


(5.220) 


(7.393) 


(6.588) 


(5.152) 


JG 


Melanoma 


27.574 


12.675 


12.168 


13.414 


10.411 


(15.836) 


(6.865) 


(7.479) 


(7.379) 


(5.860) 




All 


27.523 


12.240 


12.210 


12.553 


10.523 




(13.538) 


(5.650) 


(7.373) 


(6.775) 


(5.308) 



The maximum index is taken as 1 while the expected value of the index is calculated as 

E i,j p'tjPij + C 1 - KjX 1 - Pij) 

E[PRl(S tesU {Sk})} = 

2 

Let $ be the number of images in the entire data set, and be the number of ground-truth segmentations of image 
Then p'^ can be expressed as 

(12) 

k=l 



A, V I 1 \ / 



Since in the computation of the expected values no assumptions are made with regards to the number or size of regions in 
the segmentation, and all of the ground-truth data is used, the NPR indices are comparable across images and segmentations. 



4 Experimental Results and Discussion 

The proposed evaluation method was tested on a set of 90 dcrmoscopy images (23 invasive malignant melanoma and 67 benign) 
obtained from the EDRA Interactive Atlas of Dermoscopy [5] , and three private dermatology practices [TH] . The benign lesions 
included nevocellular nevi and dysplastic nevi. 

Manual borders were obtained by selecting a number of points on the lesion border, connecting these points by a second- 
order B-spline and finally filling the resulting closed curve. Three sets of manual borders were determined by dermatologists 
Dr. William Stoecker (WS), Dr. Joseph Malters (JM), and Dr. James Grichnik (JG) using this method. 

Five recent automated border detection methods were included in the experiments. These were orientation-sensitive fuzzy 
c- means method (OSFCM) [IT], dermatologist-like tumor extraction algorithm (DTEA) [T5J [IS] , meanshift clustering method 
(MS) [T7], modified JSEG method (JSEG) [J5], and the statistical region merging method (SRM) |TH]. Table [3] gives the mean 
and standard deviation errors as evaluated by the commonly used XOR measure (JJ). The best results, i.e. the lowest mean 
errors, in each row are shown in bold. 

It can be seen that the results vary significantly across the border sets, highlighting the subjectivity of human experts in 
the border determination procedure. Overall, the SRM method achieves the lowest mean errors followed by the DTEA and 
JSEG methods. It should be noted that, with the exception of SRM, the error rates increase in the melanoma group which is 
possibly due to the presence of higher border irregularity and color variation in these lesions. With respect to consistency, the 
best methods are DTEA followed by the SRM and JSEG methods. 

Table [4] shows the border detection quality statistics as evaluated by the proposed NPRI measure. Note that, in this table, 
higher mean values indicate lower border detection errors, whereas higher standard deviation values indicate lower consistency, 
respectively. 

It can be seen that the ranking remains the same: SRM and DTEA are still the most accurate and consistent methods. 
However, using the NPRI measure, the differences between the methods have become smaller. In addition, this measure 
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Table 4: NPRI measure statistics: mean (standard deviation) 



Diagnosis 


OSFCM 


DTEA 


MS 


JSEG 


SRM 


Benign 


0.520 


0.785 


0.774 


0.775 


0.785 


(0.247) 


(0.079) 


(0.137) 


(0.114) 


(0.109) 


Melanoma 


0.520 


0.783 


0.762 


0.748 


0.811 


(0.258) 


(0.108) 


(0.161) 


(0.141) 


(0.092) 


All 


0.520 


0.784 


0.771 


0.768 


0.791 


(0.248) 


(0.087) 


(0.142) 


(0.122) 


(0.105) 



considers the variations in the manual borders simultaneously and produces a scalar value, which makes comparisons among 
methods much easier. 

Figure Q] illustrates one advantage of using the NPRI measure. Here the manual borders are shown in red, green, and 
blue, whereas the border determined by the DTEA method is shown in black. The border detection errors with respect to 
the red, green, and blue borders calculated using the XOR measure are 10.872%, 9.342%, and 20.958%, respectively. It can 
be concluded that, with respect to the first two dermatologists, the DTEA method has an average accuracy (see Tablc[3]). On 
the other hand, with respect to the third dermatologist, the automatic method is quite inaccurate. The NPRI value in this 
case is 0.814, which is above the average over the entire data set (see Table|4]). This was expected, since this measure does not 
penalize the automatic border in those regions where dermatologist agreement is low. 




Figure 1: Sample border detection result 



5 Conclusions and Future 

In this paper, we evaluated five recent automated border detection methods on a set of 90 dermoscopy images using three sets 
of manual borders as ground-truth. We proposed the use of an objective measure, the Normalized Probabilistic Rand Index, 
which takes into account variations in the ground-truth. The results demonstrated that the differences between four of the 
evaluated border detection methods were in fact smaller than those predicted by the commonly used XOR measure. Future 
work will be directed towards the expansion of the image set and the inclusion of more dermatologists in the evaluations. 
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